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A PARALLEL PIPELINED RENDERER FOR TIME- VARYING VOLUME DATA 


TZI-CKER CHIUEH 1 AND KWAN-LIU MA 2 


Abstract. 

This paper presents a strategy for efficiently rendering time- varying volume data sets on a distributed- 
memory parallel computer. Time- varying volume data take large storage space and visualizing them requires 
reading large files continuously or periodically throughout the course of the visualization process. Instead 
of using all the processors to collectively render one volume at a time, a pipelined rendering process is 
formed by partitioning processors into groups to render multiple volumes concurrently. In this way, the 
overall rendering time may be greatly reduced because the pipelined rendering tasks are overlapped with 
the I/O required to load each volume into a group of processors; moreover, parallelization overhead may 
be reduced as a result of partitioning the processors. We modify an existing parallel volume Tenderer to 
exploit various levels of rendering parallelism and to study how the partitioning of processors may lead to 
optimal rendering performance. Two factors which are important to the overall execution time are resource 
utilization efficiency and pipeline startup latency. The optimal partitioning configuration is the one that 
balances these two factors. Tests on Intel Paragon computers show that in general optimal partitionings do 
exist for a given rendering task and result in 40-50% saving in overall rendering time. 

Key words, direct volume rendering, parallel rendering, pipelining, time-varying data, MPP computers. 

Subject classification. Computer Science 

1. Introduction. Time- varying volumetric data sets (TWD), which may be obtained from numerical 
simulations or sensing instruments, provide scientists insights into the detailed dynamics of the phenomenon 
under study. When appropriately rendered, they form an animation sequence that can illustrates how the 
underlying structures evolve over time. For visualizing large data sets, parallel processing is often used to 
speed up the expensive volumetric rendering process. Although the subject of rendering a single volumetric 
data set using a parallel computer has been studied extensively by numerous researchers [17, 16, 14, 22, 10], 
parallel animation of TWD, in contrast, has received relatively little attention. 

Compared to parallel volume rendering of a single data set, rendering TWD in parallel poses a different 
set of design tradeoffs. First, because TWD typically consists of a sequence of data volumes, the I/O 
overhead to bring the data into the parallel machines accounts for a significant portion of the end-to-end 
response time, and can no longer be ignored as is done in many analyses of parallel volume rendering. The 
key technique to address this I/O problem is to hide the I/O overhead by overlapping computation with 
I/O. Secondly, since a TWD rendering job is actually comprised of multiple rendering tasks, it is important 
to make efficient utilization of the computation resources so that the overall rendering time is minimized. 
In particular, one should remember that parallelization almost always incurs certain overhead such as data 
distribution, communication of intermediate results, result collection, and synchronization. Therefore it is 
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critical to achieve a balance between the parallelism and overhead of individual rendering tasks, with the goal 
of optimizing the overall performance of the entire TWD rendering job. Thirdly, whereas in single- data- set 
rendering the response time is the single most important criterion, in TWD rendering there are multiple 
criteria that are potentially of interest to the users. One possibility is the start-up latency, the time until the 
first image appears. Another candidate is the overall execution time, the time until the last image appears. 
Depending on the requirements of the end users, different design tradeoffs need to be made to optimize 
different performance criteria. 

We argue that parallel volume animation requires re-thinking of the types of parallelism one should 
exploit to achieve the optimal performance. In particular, I/O overlap and resource utilization efficiency play 
a crucial role in the parallelization strategy. We start with a generic parallel volume rendering program [14], 
modify it to experiment with different approaches for parallel volume animation of time- varying data sets, and 
analyze the performance tradeoff among various partitioning strategies. Although the results and analysis 
are based on implementations on an Intel Paragon, we believe that the conclusions should remain valid for 
other parallel distributed memory architectures. 

2. Related Work. Ideally, visualizing time-varying volume data should be done while data are being 
generated, so that users receive immediate visual feedback on the subject under study, and so the visualization 
results can be stored rather than the much larger raw data. VISUAL3 [7] and SCIRun [18] are among the 
many software systems that can support runtime tracking of three-dimensional numerical simulations. These 
systems may be operated in a distributed computing environment. Rowlan [19] and Ma [12] also demonstrate 
such tracking capability using direct volume rendering on a massively parallel computer. However, runtime 
tracking is not always possible and desirable for certain applications. For example, one may want to explore 
the data set from different perspectives; or, the amount of computation power required for real-time rendering 
or a special visualization technique may be not readily available. As a result, postprocessing of pre- calculated 
data remains an important requirement. 

Several techniques have been developed for visualizing time- varying data as a postprocess. Lane [11] 
developed a particle tracer for three-dimensional time-dependent flow data. Max and Becker [15] apply 
textures for visualizing both steady and unsteady flow field. Silver and Wang [23] present a volume based 
feature tracking algorithm to help visualize and analyze large time-varying data sets. More recently, Jaswal 
demonstrates distributed real-time visualization of time- varying data using a CAVE [8]. He identifies that 
I/O is the single most constraining factor in the level of interactivity and suggests to perform various types 
of filtering to reduce the amount of data sent and rendered. 

More closely related to our work is the ray-cast rendering strategy introduced by Shen and Johnson [21] 
which they call differential volume rendering. By exploiting the data coherency between consecutive time 
steps, they are able to reduce not only the rendering time but also the storage space by 90% for their two 
test data sets. Differential volume rendering is potentially parallelizable and a caching technique [13] may 
be integrated into the renderer to avoid recalculations for visualizing irregular data. Goel and Mukherjee [6] 
also develop an approach similar to Shen and Johnson’s and achieve comparable saving. 

Turning to the I/O issue, the MPI-IO initiative [1] represents an effort to develop a standard for portable 
parallel I/O. Even with the presence of parallel I/O, we cannot guarantee that I/O time becomes less 
dominant, especially when processor technology is advancing at a faster pace than I/O technology. In fact, 
the strategy we develop in this research can be used in conjunction with parallel I/O to achieve maximum 
performance. 

There has also been previous research investigating the I/O characteristics of graphics and visualization 
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applications on parallel computers [5, 20]. Chiueh [3] presented a memory access algorithm that allows 
conflict-free access to an interleaved memory system that stores volumetric data sets. The same algorithm 
is directly applicable in the context of parallel disk arrays. The work described here, in contrast, focuses 
mostly on resource utilization and parallelism to optimize the overall process of visualizing time-varying 
volume data on parallel distributed- memory architectures. We also want to to investigate the feasibility of 
building a volumetric data management system [9, 2] that is easy to use on the one hand, and is capable of 
efficiently interfacing with parallel rendering engines on the other. 

3. Parallelization Approaches. The basic structure of a generic parallel volume rendering pro- 
gram [14] forms a three-step pipeline: 3D data distribution , in which the volumetric data set is decomposed 
into subvolumes and distributed to the processor nodes, subvolume rendering , in which each processor node 
renders the assigned subvolume into a 2D subimage, and image compositing , in which the set of 2D subim- 
ages from the previous step are composited according to the view angle to arrive at the final 2D projected 
image. When the degree of parallelism is small to modest, i.e., under 16 nodes, the major portion of the 
computational overhead is attributed to subvolume rendering . However, when the degree of parallelism is 
high or when the data set itself is large (say 1024 3 ), 3D data distribution becomes a significant performance 
factor. 

Given a generic parallel volume Tenderer and a P-processor machine, there are three possible approaches 
to turn it into a parallel volume animator for TWD sets. The first approach simply runs the parallel volume 
Tenderer on the sequence of data sets one after another. At any point in time, the entire P - processor machine 
is dedicated to rendering a particular volume 1 . Therefore, only the parallelism associated with rendering 
a single data volume, i.e., intra-volume parallelism, has been exploited. The second approach takes the 
exact opposite approach by rendering P data volumes simultaneously, each on one processor. This approach 
thus only exploits inter-volume parallelism. As the optimal systems performance can only be achieved by 
carefully balancing two performance factors, resource utilization efficiency and parallelization overhead, both 
intra-volume and inter-volume parallelism should be exploited. The third approach is a hybrid, in which P 
processor nodes are partitioned into L groups (1 < L < P), each of which renders one data volume at 
a time. We will show later that the third approach indeed performs the best among the three. However, 
the optimal choice of L depends on the type and scale of parallel machine as well as the size of data set. 
Detailed characterizations of the optimal partitioning strategy are described in Section 5. 

4. Performance Analysis. 

4.1. Metrics. Parallel volume animation of TWD sets involves rendering multiple data volumes in 
a single task. There are three potential performance metrics: start-up latency , the time until the rendered 
image of the first volume appears; overall execution time y the time until the rendered image of the last volume 
appears; and inter- frame delay , the average time between the appearance of consecutive rendered images. In 
conventional volume rendering applications, since only one data set is involved, start-up latency and overall 
execution time are the same, and inter-frame delay is irrelevant. However, when volume animation is used 
interactively, start-up latency and inter-frame delay play crucial role in determining the effectiveness of the 
system. When volume animation is run in a batch mode, overall execution time should be the major concern. 
Note that different design tradeoffs have to be made for different performance criterion. For example, if start- 
up latency is the criterion of choice, then the first approach discussed in Section 3 probably should be the 

x Here we assume the pipeline effect is ignored. 
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design of choice. In the rest of the paper, we will use the overall execution time as the main criterion and 
only mention the other two when appropriate. 

4.2. Performance Models. Before we present our experiments, it’s useful to construct a performance 
model for each of the approaches described above so that one can have a basic understanding of the results. 
For the rest of the discussion in this paper, without limiting the applicability of our research results, we 
assume a completely serial I/O system in order to focus on other issues. 

Assume that there are N data volumes in the TWD set, there are P processors in the system, and 
without loss of generality N = k * P. Let t or (p) denote the total rendering time for a single data volume 
using p processors, including file access and data distribution, rendering, compositing, and image delivery, 
tio(p) the time to distribute a data set from the disk to the p processors in the beginning of rendering a 
data volume, and T(L) the overall execution time for rendering N data volumes when P processors are 
decomposed into L groups, each of which consists of ^ processors. 

For the intra-volume approach, the overall execution time is 

(4.1) r(l) = Nxt or {P) 

Because P processors are collectively used to render one data volume at a time, the rendering task for the 
j-th volume won’t start until that for the (j - l)-th volume ends. The timing diagram for this approach is 
shown in Figure 1. For the inter-volume approach, the overall execution time is 

(4.2) T(P) = kxmax{t or (l), Pxtio(l)} + mm{f or (l) - £«>(1), (P - l)x£i 0 (l)} 

Because each data volume is rendered only by a single processor, there are at most P concurrent rendering 
tasks on the system. If P*t io (l) > £ or (l), then the system is IO-bound. That is, the rendering task for the 
(P-f j)-th volume cannot start immediately after the j - th volume is done. The second term in Equation (2) 
accounts for the fact that the completion time for the iV-th volume is later than that for the (N — P 4- l)-th 
volume either by (P — 1) * £io(l) when £ or (l) < P*fi 0 (l), or by £ or (l) — £i 0 (l) when t or (l) > P*£i 0 (l). The 
timing diagram for the inter- volume approach assuming t or ( 1) > Pxt io ( 1) is shown in Figure 2. For the 
hybrid approach, assume that P processors are divided into L groups, each of which now contains P g = £ 
processors, then the overall execution time is 

(4.3) T’(L) = — xmax{£ OT -(P s ) , Lxti 0 (Pg )} + mQ,x{t 0 r{Pg') ~~ £io(Pg)i (L 1) xti 0 (P s )} 

As can be seen, the performance formula for the inter-volume approach is essentially an instance of that of 
the hybrid approach when L — P. Note that whether the rendering task is IO-bound or CPU-bound depends 
on the size of the data set as well as the number of processors in the system. 

5. Test Results. 

5.1. Experiment Setup. An existing parallel volume Tenderer [14] is modified in such a way that it can 
exploit different levels of intra- volume and inter- volume parallelism by varying the configuration parameter 
L, the n um ber of processors dedicated to a single volume given that the total number of processors is fixed. 
Our tests were run on a 72-node Intel Paragon computer operated by the NASA Langley Research Center 
as well as the 512-node Intel Paragon computer at the California Institute of Technology. The data set 
is obtained from a time- dependent turbulence simulation and its size is 128x128x128. Snapshots from a 
volume-rendered animation of the data showing vorticity magnitude are shown in Figure 3. Image resolution 
is 256x256. 

The general structure of the program is shown in Figure 4. Given P processor nodes, there are L virtual 
rendering nodes, each of which consists of £ physical processor nodes. In addition, a host node performs 
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Fig. 1. Utilization of system components under the Intra- Volume approach. The numbers denote the data volume number . 



Fig. 2. Utilization of system components under the Inter- Volume approach when t or ( 1) > Px£i 0 (l). The number in 
each box denotes the data volume number. The number of processors, P, is assumed to be 4- 

disk I/O access and volumetric data distribution. The same host node also collects the 2D subimages from 
each node to form the resultant image and sends it to the end user over the network. Because multiple 
data volumes are being rendered simultaneously, appropriate flow control is needed to maintain appropriate 
synchronization between the host node and the virtual render nodes. These are indicated in Figure 4 as gray 
lines going in both direction. Without proper synchronization, subimages from different rendering runs may 
become intermixed. For the rest of the discussion, the term “number of processors” refers to the number 
of physical processor nodes involved in rendering only, i.c., excluding the I/O and display nodes. Also, the 
number of data volumes rendered in each run is made equal to the number of physical processors. We make 
this assumption to ensure that the pipeline start-up overhead will be appropriately accounted for in the 
performance evaluation. 
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Fig. 3. Snapshots from an animation of three-dimensional turbulent shear flow calculations . The numerical model was 
developed by Dr. J. Shebalin at the Fluid Mechanics and Acoustics Division of NASA Langley Research Center and the 
calculations were done on a Cray YMP 
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Fic. 4. Software architecture of the implemented parallel volume animator. P computation processors are partitioned into 
L virtual rendering nodes , each of which is responsible for rendering a single data volume loaded from disk through a host node. 
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Fig. 5. The overall execution time versus the number of partitions for three different processor sizes. 


5.2. Results and Analysis. Our conjecture that the optimal performance can only be achieved by 
effectively exploiting both intra- volume and inter- volume parallelism is confirmed by Figure 5, which illus- 
trates the relationship between the overall execution time and the number of processor partitions (L), and is 
on a log 2 scale along the X axis. With 16 processors, the optimal number of partitions for rendering 16 data 
volumes is 2 or 4; with 32 processors, the optimal number for rendering 32 volumes becomes 4 or 8; with 64 
processors, the optimal number is 8. We want to re-emphasize that the overall execution time shown con- 
sists of three phases: data distribution , which includes both disk I/O and data distribution; rendering , which 
includes rendering and compositing; and image display , which includes collecting subimages and transferring 
the final image over the network. 

Intuitively, when L = 1, each data volume is rendered one after another, without any overlap between 
different phases from consecutive runs. As a result, the utilization of various system components, as shown 
in Figure 1, is inherently suboptimal. For example, the utilization of the rendering nodes is 


^ rendering 

l data -distribution ^ rendering “t* t display 


On the other hand, when L = P, it takes at least P runs for the entire pipeline to become active, as 
shown in Figure 2, where P is assumed to be 4. Since we assume there are a total of P data volumes in 
the sequence, the pipeline never has a chance to achieve its optimal throughput. Consequently, the overall 
execution time is the worst among all possible configurations for a fixed number of processors. It should be 
noted, however, that when the number of data volumes in the time- varying data set is much larger than the 
number of physical processors so that the start-up overhead can be effectively amortized, the inter- volume 
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(seconds) 



Fig. 6. The overall execution time , start-up latency, and average inter-frame delay versus the number of partitions , when 
P =32 


approach should achieve the best overall execution time because it incurs the least parallelism overhead. 
Our test results in fact show such a trend in both 16- and 64-processor cases. In practice, this assumption is 
not necessarily true — when the data set size exceeds the node memory, the inter- volume approach is simply 
not feasible. The optimal partitioning presumably minimizes the start-up overhead while maximizing the 
utilization efficiency of the rendering nodes. 

As we mentioned earlier, there are multiple performance criteria for parallel volume animation of TWD 
sets. Figure 6 shows the behavior of the three criteria described earlier versus the degree of partitioning, 
and the tradeoff among them. The number of processors in this case is fixed at 32. The start-up latency 
is monotonically increasing with the number of partitions because the number of processors dedicated to a 
single data volume is decreasing. The average inter-frame delay is computed by subtracting the start-up 
latency from the overall execution time and dividing the result by the number of data volumes rendered. 
Because of the dominance of the overall execution time, the inter-frame delay exhibits a somewhat similar 
curve as that associated with overall execution time. The computed inter-frame delay is almost identical to 
the average of the inter-frame delays from actual measurements. Note that the computed average inter- frame 
delay doesn’t necessarily correspond to the apparent inter-frame delay that users experience. In general, the 
rendered frames come in a burst, stop for a while, and repeat again. The fact that there is a stop period 
is symptomatic of an imbalance between the data distribution and rendering phases. It is interesting to 
observe that the smoothest rendering, i.e., the one with the shortest stop period between bursts, indeed 
occurs under the configuration that has the smallest overall execution time, because it is the most balanced 
among system components. For P — 32, L op timal = 4 or 8. 
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Table 1 

A breakdown of the rendering time for generating a single frame when using up to 32 processors. 


tasks 

32 nodes 

16 nodes 

8 nodes 

4 nodes 

2 nodes 

1 node 

initialize Tenderer 

0.269 

0.654 

1.593 

3.36 

7.02 

12.96 

ray-cast resample 

2.8 

5.5 

9.5 

19 

37 

64 

composite partial images 

1.068 

1.43 

2.32 

3.747 

5.96 

0.00 

total time 

4.137 

7.584 

13.413 

26.107 

49.98 

76.96 


Overall Time 



Fig. 7. Comparing the measured performance with the predicted performance for the 32-processor case. 


Table 1 displays a more detailed look at the rendering cost, in which we show the time to generate a single 
frame by using up to 32 processors. The initialization time is mostly for computing the voxel gradient values 
for lighting calculations. This initialization must be done for each volume. Both initialization and the ray-cast 
resampling time increase in inverse proportion to the number of processors which are used to render a volume. 
The compositing time, which includes both calculation and communication components, also decreases when 
more processors are used, except for the one-processor case in which no compositing calculation is needed 
after the resampling process. The total rendering time illustrates the increasing parallelization penalty we get 
when using more processors in a partition. Hence, with the same number of processors, rendering multiple 
volumes concurrently reduces the aggregate parallelization overhead and gives us better overall throughput. 

Note that the C++ implementation of the Tenderer preclude us from using the Paragon’s native compiler, 
resulting in at least 30% performance degradation. While we may be able to optimize our Tenderer to obtain 
better rendering rates, this would show more significantly the relative performance degradation due to I/O 
delay. 

Finally, in Figure 7, we compare the measured performance with the predicted performance for the 
32-processor case. To predict performance using our model, we must first determine the values of t lo and 
tor. This is done by running rendering jobs using one partition. Figure 7 shows that the measured overall 
execution time correlates quite well with the prediction from the model, though some small discrepancies 
occur. Presumably this is because the performance model in Section 4 is stated in terms of delays associated 
with high-level primitives. 
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6. Conclusions. Rendering time- varying volumetric data sets poses a different problem than rendering 
a single- volume data set. We start with a naive approach by repeating the execution of a generic parallel 
volume Tenderer on the time- varying sequence of 3D data sets, and find that during the beginning and the end 
of the rendering process for a single data set, the nodes are mostly idle, thus wasting resources unnecessarily. 
To address this problem, we try to pipeline the rendering tasks for consecutive data sets in the sequence, 
essentially exploiting inter-volume as well as intra-volume parallelism. Given a fixed number of processor 
nodes and I/O bandwidth, the research question is what the optimal balance is between inter- volume and 
intra-volume parallelism exploitation. We have implemented a prototype volume Tenderer that embodies 
the idea of pipelined rendering for time- varying data sets. We are able to attain the most effective system 
utilization bounded only by the data distribution overhead. We also identify three possible performance 
criteria for evaluating TWD data sets, and show that different partitioning strategies are needed to optimize 
for different criteria. 

Our results show that there indeed exists an optimal partitioning for a given data set and a parallel 
computer configuration. But the optimum depends on such factors as the machine size, the length of TWD 
sequence, and the ratio between computation and communication/IO overheads, which in turn is affected 
by the hardware characteristics and the coherence property of the data set itself. If these hardware and 
data-specific parameters were available, an optimal partitioning could be determined automatically. 

This study also helps us identify the design issues to construct a volumetric data management system 
that can interface with parallel rendering engines efficiently. In this work, we find that a dedicated I/O 
manager plays an important role in improving the overall performance of TWD rendering. It thus seems 
logical to include such an I/O manager in the envisioned volumetric data management system. However, 
there remains the work of developing a sufficiently flexible interface for the I/O manager that can smoothly 
intergrate with a wide variety of parallel Tenderers. As part of the volumetric database project, we are also 
working on volumetric data compression algorithms [4] that are shown to be “friendly” to volume Tenderers, 
i.e., algorithms that can effectively exploit the coherency properties of volume rendering computation. 

6.1. Future Work. As we mentioned, this approach can be used in conjunction with parallel I/O 
facilities to achieve even better rendering rates. Furthermore, with a good parallel I/O system, the Tenderer 
can also read ahead by keeping multiple buffers at each rendering node: one for the current frame being 
rendered and one for the next frame being read ahead. The read-aheads would then have to use asynchronous 
read requests which return after the read is queued but before it completes. 

The current implementation of the Tenderer may be optimized in two ways. First, it takes a slice- by- 
slice broadcasting approach to distribute the volume data set to the processor nodes, which then pick up 
the assigned portions of the slices. A more efficient approach is to store 3D subvolumes on the disk, and 
distribute 3D subvolumes to appropriate nodes directly. One advantage of this approach is the reduction of 
intermediate packing/unpacking overhead. Ultimately a database system specifically designed for efficient 
access to volumetric data will be the most desirable solution. 

Second, all processors involved in a rendering run currently have to be either implicitly or explicitly 
synchronized. As a result, additional synchronization overhead is inevitable. An alternative approach is to 
take a dataflow, function ally- specialized model in which each processor node receives data packets, performs 
a fixed function, and sends them to the next processor node in the logical pipeline. Each piece of data 
travels across the system with a tag to identify the associated volume. With this architecture, there is no 
need to synchronize the processors in a lock-step fashion, thus reducing the synchronization delay. It’s up 
to the final pipeline stage to pull the subimages together and form the final image. All other nodes are 
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in an autonomous loop and operate completely independently of one another. Because throughput is more 
important than latency for parallel volume animation, this model seems to be a better fit than the current 
implementation. 
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